========================================================
I explored the white wine quality data set. Sequence number column in original dataset is removed since it is not very helpful.
## 'data.frame': 4898 obs. of 12 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
From summary, this data set has 4898 white wine samples. Here are some observations:
Quality ranges from 3 to 9 with 6 as median, and 75% of wines are under 6, which looks like around half of the wines have quality 5 and 6.
Fixed, volatile and citric acidity all has very wide range, for example, fixed acidity ranges from 3.8 to 14.2 and the max value 14.2 almost doubles the 3rd quantile value 7.3. Other 2 acidity variables has similar pattern which tells me that there may be some outliers at high acidity end.
Sugar, Chlorides also has similar patterns as acidity. For example the max Sugar is 65 while 3rd quintile is only 9.9. Max Chlorides is 0.346 and 3rd quantile is 0,05.
Density range is pretty narrow from 0.9871 to 1.0390
Min pH is 2.72 and max is 3.82.
Min alcohol is 8.0 and max is 14.2
## $title
## [1] "Quality Distribution"
##
## attr(,"class")
## [1] "labels"
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
There are only 5 white wines sample has best quality((9) and 20 samples with worst quality(3). The best quality wines may be very rare and hard to find and worse ones may be due to production defect which also not very many. And the distribution close to normal distribution.
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
pH is normal distribution and there is a spike around pH 3.16-3.18
Alcohol is rarely over 14.0% and less than 8.5. Spike is at around 9.5%
Chlorides is rarely over 0.06 and there is a spike around its median value. If more outliers are trimmed, it also looks like a normal distribution.
Residual sugar has two big spikes at 1-1.5 an 1.5 -2.0. It has a long tail on right side.
Transform scale to log 10 and square root for residual sugar. In log10, I saw binomial distribution.
I grouped acidity related attributes together to compare their distributions. outliers are excluded in plots above. They have similar pattern which are all close to normal distribution especially fixed acidity distribution.
I grouped sulfur related attributes together to compare their distributions as well. outliers are excluded in plots above. The pattern also looks alike. All 3 have normal distribution with spike around their respect median value.
There are 12 features in this dataset:Fixed.acidity, votatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality. Except quality is integer, other feature are all numbers. For white wine quality ranging from 3 to 9, larger the number, better the quality. Best quality wines and worse quality wine are both very few. Median quality is 6 and there are 2198 wine samples in this quality which is 45% of whole dataset.
The main feature for this dataset is quality. I will explore relationships between quality and other features and try to find if wine quality can be predicated by its chemical attributes.
Alcohol is number one feature need to be evaluated. Then pH, fixed.acidity(other acidity attributes), total.sulfur.dioxide(other sulfur attributes) and chlorides since they are all has close to normal distribution.
No, not in this section. However I will create a new one to convert quality to factor type, so that I can use it as categorical feature in plotting.
Residual.sugar has very long tail in histogram, so I transformed it to log10 and square root to see if I can get a better distribution, which is close to normal distribution. In log10 transformation, it shows a bi-normal distribution with one pike at 1.5 and other at 8.5.
## Warning in loop_apply(n, do.ply): Removed 10 rows containing missing
## values (geom_point).
Higher quality wines tend to have higher alcohol content.
## Warning in loop_apply(n, do.ply): Removed 160 rows containing missing
## values (geom_point).
There is no visiable correlation between quality and chlorides.
## Warning in loop_apply(n, do.ply): Removed 64 rows containing missing
## values (geom_point).
Quality decreases when density increases.
## Warning in loop_apply(n, do.ply): Removed 10 rows containing missing
## values (geom_point).
I can’t see much pH impact on quality of wines. So I would like to draw a histogram by quality
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
pH distribution for each wine quality is very close from quality 4-8(excluding quality 3 and 9). So pH looks like having very little impact on quality
## Warning in loop_apply(n, do.ply): Removed 61 rows containing missing
## values (geom_point).
There is no visible correlation between quality and redidual sugar. In the each quality grade, however, more wines has residual suguarless than 5
## Warning in loop_apply(n, do.ply): Removed 59 rows containing missing
## values (geom_point).
## Warning in loop_apply(n, do.ply): Removed 49 rows containing non-finite
## values (stat_boxplot).
Except wines with quality 3 and 4. It looks like that higher quality wine tends to have lower total free sulfur dioxide.
## Warning in loop_apply(n, do.ply): Removed 63 rows containing missing
## values (geom_point).
## Warning in loop_apply(n, do.ply): Removed 48 rows containing non-finite
## values (stat_boxplot).
To my surprise, quality 6, 7, 8 has very close median volatile acidity. It is not very obvious that volatile acidity has much impact on wine quality.
After exploring relationship between quality and 7 other major features, I continue to investigate the relationship between other feature pairs like density Vs alcohol, density Vs residual sugar, volatile acidity Vs fixed acidity ect.
Density decreases when alcohol increases. Density increase when residual sugar increases. And although there are some spikes, overall, residual sugar decrease as alcohol increase.
## Warning in loop_apply(n, do.ply): Removed 85 rows containing missing
## values (stat_smooth).
## Warning in loop_apply(n, do.ply): Removed 85 rows containing missing
## values (geom_point).
## Warning in loop_apply(n, do.ply): Removed 89 rows containing missing
## values (stat_smooth).
## Warning in loop_apply(n, do.ply): Removed 89 rows containing missing
## values (geom_point).
Free sulfur dioxide has strong positive correlation with total sulfur dioxide. On the other hand, Volatile acidity looks irrelevant to fixed acidity.
Quality correlats strong with Alcohol, density. Quality also has some correlction with total sulphur disxide.
As Alcohol content increase, quality of wine gets better. And it looks like they have linear relationship. However, the worse quality wines has higher median alcohol than quality 4 and 5. Quality 4 median is also higher than 5. This may be due to number of samples in 3 and 4 are very small.
As Density increases, quality of wine gets worse.
As total sulfur dioxide increase the quality of wine gets worse for wines from 5-9 but the correlation doesn’t looks very strong
PH, fixed acidity, free sulfur dioxide have no visible correlation with quality, which surprises me.
Yes. I found following interesting relationships:
Density is strongly correlated with alcohol and residual sugar. As alcohol increases, density decreases. As residual sugar increase, density increase as well. In fermentation process, sugar will be decomposed and produce water and alcohol. This observation makes lots of sense.
Free sulfur dioxide increases as total sulfur dioxide increases and the correlation is strong, which also makes sense.
Another surprise to me, volatile acidity looks like having no relationship with fixed acidity. I guess they are different type of acids, which can’t be tranformed from one to another
Density and residual sugar.
I added more features to my plots to observe impact on white wine quality by two or more features. To avoid too many levels plots in this section, subset whitewines_sub will be used since best and worst quality categories have very few sample, which can be treated as outliers.
## Warning in loop_apply(n, do.ply): Removed 46 rows containing missing
## values (geom_point).
Lower alcohol and higher total sulfur dioxide area has more lower quality wines, and higher alcohol and lower total sulfur dioxide area has more higher quality wines.
## Warning in loop_apply(n, do.ply): Removed 44 rows containing missing
## values (geom_point).
Lower alcohol and higher fixed acidity area has more lower quality wines, and higher alcohol and lower fixed acidity area has more higher quality wines.
## Warning in loop_apply(n, do.ply): Removed 47 rows containing missing
## values (geom_point).
Higher alcohol and lower residual sugar area has very dense high quality wine data points. However, at lower alcohol area, lower quality wines almost evenly distributed across whole residual sugar range.
pH doesn’t have strong correlation with alcohol. And better quality wines has more dots on higher alcohol end and but almost evenly distributed within pH for each quality.
## Warning in loop_apply(n, do.ply): Removed 39 rows containing missing
## values (geom_point).
Free Sulfur Dioxide doesn’t have strong correlation with alcohol.
## Warning in loop_apply(n, do.ply): Removed 50 rows containing missing
## values (geom_point).
Sulphates doesn’t have strong correlation with alcohol. And it not correlated with quality.
Added new parameter total.acidity which is sum of fixed adicity, volatile acidity and citric acid.
Median total acidity is very close across all wine quality.
## Warning in loop_apply(n, do.ply): Removed 177 rows containing missing
## values (geom_point).
## $title
## [1] "pH Vs Total Acidity by Quality"
##
## attr(,"class")
## [1] "labels"
pH decreases as total acidity increases. Quality looks has no visible correlation with pH and total acidity.
## Warning in loop_apply(n, do.ply): Removed 177 rows containing missing
## values (geom_point).
I saw similar distribution with fix acidity Vs alcohol by quality. So the new parameter doesn’t seems add too much value in analysis here.
##
## Calls:
## m1: lm(formula = quality ~ alcohol, data = whitewines)
## m2: lm(formula = quality ~ alcohol + total.sulfur.dioxide, data = whitewines)
## m3: lm(formula = quality ~ alcohol + total.sulfur.dioxide + fixed.acidity,
## data = whitewines)
## m4: lm(formula = quality ~ alcohol + total.sulfur.dioxide + residual.sugar,
## data = whitewines)
##
## =============================================================
## m1 m2 m3 m4
## -------------------------------------------------------------
## (Intercept) 2.582*** 2.419*** 2.911*** 2.048***
## (0.098) (0.133) (0.167) (0.139)
## alcohol 0.313*** 0.322*** 0.317*** 0.352***
## (0.009) (0.010) (0.010) (0.011)
## total.sulfur.dioxide 0.001 0.001* -0.000
## (0.000) (0.000) (0.000)
## fixed.acidity -0.066***
## (0.014)
## residual.sugar 0.022***
## (0.003)
## -------------------------------------------------------------
## R-squared 0.190 0.190 0.194 0.202
## adj. R-squared 0.190 0.190 0.194 0.201
## sigma 0.797 0.797 0.795 0.791
## F 1146.395 575.100 393.081 412.870
## p 0.000 0.000 0.000 0.000
## Log-likelihood -5839.391 -5837.755 -5825.920 -5802.097
## Deviance 3112.257 3110.178 3095.184 3065.221
## AIC 11684.782 11683.510 11661.839 11614.193
## BIC 11704.272 11709.496 11694.322 11646.676
## N 4898 4898 4898 4898
## =============================================================
Total sulfur dioxide and fixed acidity makes the impact of alcohol on quality stronger.
As total sulfur dioxide decrease and alcohol increase, quality gets better.
As fixed acidity decrease and alcohol increase, quality gets better.
I added up acid related parameter to create a new parameter total acidity. Total acidity has strong negative correction with pH, which makes sense.
I still can’t see much relationship between quality and pH, between quality and free sulfur dioxide. I thought that those two features should have strong correlation with quality when goggling chemical attribute about wines.
Yes. I created one linear model. However, by the R square, the reliability of this model is very low only 0.19 So it looks like simple linear model is not a good model to predict quality of white wines.
I used added factor type parameter quality.factor to re-draw histogram which look better than using quality. Quality has normal distribution and the median quality has almost half of wines in the data set, As quality decrease or increase, the number of wines decrease quickly.
## Warning in loop_apply(n, do.ply): Removed 19 rows containing non-finite
## values (stat_boxplot).
To my surprise, the worse quality wines has higher median alcohol than quality 4 and 5. Quality 4 median is also higher than 5. This may be due to number of samples in 3 and 4 are comparatively small. Or maybe, other chemical component in those worst quality wine downgrades the quality even if alcohol content is relatively high. For wines quality better than 4, it is obvious that higher alcohol more likely to has higher quality.
## Warning in loop_apply(n, do.ply): Removed 25 rows containing missing
## values (geom_point).
Higher quality white wines tends to have lower total sulfur dioxide and higher alcohol since density of low quality(quality 4, 5, 6) white wine is very high at high total sulfur dioxide and low alcohol area and density of high quality white wine is high at low total sulfur dioxide and high alcohol area.
I picked up this dataset because I started to like drinking wines two years ago. I really curious if I can find some clue to tell good wines from bad ones from their chemical attributes. This set has 4898 white wine samples which is a good size for practice purpose not too big but large enough to draw nice plots.
I started with each individual variables. Wine quality are normal distributed which makes sense since most wines are in the median quality which probably sell at affordable price. So demands for this type of wine is the largest. Normal distributions are also can be seen in following features: fixed acidity, volatile acidity, citric acidity, free sulfur dioxide, total sulfur dioxide, density, pH. So I initially thought there should be a strong linear relationship between quality and other features.
Then I paired other features with quality to explore relationship. To my surprise, I only see strong correlation between quality and alcohol. There are visible but not so strong correlation between quality and fixed acidity, quality and total sulfur dioxide. pH, residual sugar were my top candidates but to my disappointment, I couldn’t find any visible correlation there. I also found the alcohol and residual sugar and density has negative correlations. And residual and density has strong correlation.
Finally, I add third features to the pairs. I did find that total sulfur dioxide and fixed acidity enhances the correlation between alcohol and quality. So I tried build a linear model using alcohol, total sulfur dioxide, fixed acidity, pH and residual sugar. It looks like that model is not very successful since the R square is very low only 0.19. That bothers me lot.
Here are some of my thought about the this dataset and possible improvement beyond scope of this project:
Wines samples in this dataset are all from particular area in Portugal which may have some bias related to that region.
Best and worst quality sample size are very small, which makes the modeling in high end and low end part is not very accurate.
pH, and residual sugar should play a role in wine’s taste, however, I failed to find correlation in this dataset, which tells me that there may be something missing in this dataset. Or this feature can be transformed in a way that some relationship can be better represented.
The quality is scaled from 0-10. In this dataset, it ranges from 3 to 9. It looks like a categorical feature to me. So linear model may not be the best way to predict. Other regression model can be used to clacify wine qualtity by its chemical property.